Designing Machine Learning Systems by Chip Huyen

Designing Machine Learning Systems by Chip Huyen

Author:Chip Huyen [Chip Huyen]
Language: eng
Format: epub
Publisher: O'Reilly Media, Inc.
Published: 2022-06-24T16:00:00+00:00


Summary

Training data still forms the foundation of modern ML algorithms. No matter how clever your algorithms might be, if your training data is bad, your algorithms won’t be able to perform well. It’s worth it to invest time and effort to curate and create training data that will enable your algorithms to learn something meaningful.

Once you’ve had your training data, you will want to extract features from it to train your ML models, which we will cover in the next chapter.

1 Some readers might argue that this approach might not work with large models, as certain large models don’t work for small datasets but work well with a lot more data. In this case, it’s still important to experiment with datasets of different sizes to figure out the effect of the dataset size on your model.

2 Multilabel tasks are tasks where one example can have multiple labels.

3 If something is so obvious to label, you wouldn’t need domain expertise.

4 Snorkel: Rapid Training Data Creation with Weak Supervision (Ratner et al., 2017, Proceedings of the VLDB Endowment, Vol. 11, No. 3)

5 Snorkel: Rapid Training Data Creation with Weak Supervision (Ratner et al., 2017)

6 Cross-Modal Data Programming Enables Rapid Medical Machine Learning (Dunnmon et al., 2020)

7 Combining Labeled and Unlabeled Data with Co-Training (Blum and Mitchell, 1998)

8 Realistic Evaluation of Deep Semi-Supervised Learning Algorithms (Oliver et al., NeurIPS 2018)

9 A token can be a word, a character, or part of a word.

10 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)

11 Language Models are Few-Shot Learners (OpenAI 2020)

12 Active Learning Literature Survey (Burr Settles, 2010)

13 Queries and Concept Learning (Dana Angluin, 1988)

14 Bridging AI’s Proof-of-Concept to Production Gap (Andrew Ng, Stanford HAI 2020)

15 The Class Imbalance Problem: A Systematic Study (Nathalie Japkowciz and Shaju Stephen, 2002)

16 The Class Imbalance Problem: Significance and Strategies (Nathalie Japkowicz, 2000)

17 Facial action recognition using very deep networks for highly imbalanced class distribution (Ding et al., 2017)

18 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches (Galar et al., 2011)

19 As of July 2021, when you use scikit-learn.metrics.f1_score, pos_label is set to 1 by default, but you can change to 0 if you want 0 to be your positive label.

20 The Relationship Between Precision-Recall and ROC Curves (Davis and Goadrich, 2006).

21 Resampling strategies for imbalanced datasets (Rafael Alencar, Kaggle 2018)

22 An Experiment with the Edited Nearest-Neighbor Rule (Ivan Tomek, IEEE 1876)

23 “Convex” here approximately means “linear”.

24 KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction (Zhang and Mani, 2003)

25 Addressing the curse of imbalanced training sets: one-sided selection (Kubat and Matwin, 2000)

26 Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning (Lee et al., 2016)

27 Dynamic sampling in convolutional neural networks for imbalanced data classification (Pouyanfar et al., 2018)

28 The foundations of cost-sensitive learning (Elkan, IJCAI 2001)

29 Focal Loss for Dense Object Detection (Lin et al., 2017)

30 Focal Loss for Dense Object Detection (Lin et al., 2017)

31 ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.